Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.2.0
    • MLlib
    • None

    Description

      This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema.

      .Sample code

      Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree.
      The proposed pipeline with dataset looks like the following (need more refinements):

      sqlContext.jsonFile("/path/to/training/events", 0.01).registerTempTable("event")
      val training = sqlContext.sql("""
        SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action AS label,
               user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures,
               ad.targetGender AS targetGender
          FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;""").cache()
      
      val indexer = new Indexer()
      val interactor = new Interactor()
      val fvAssembler = new FeatureVectorAssembler()
      val treeClassifer = new DecisionTreeClassifer()
      
      val paramMap = new ParamMap()
        .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
        .put(indexer.sortByFrequency, true)
        .put(interactor.features, Map("genderMatch" -> Array("userGender", "targetGender")))
        .put(fvAssembler.features, Map("features" -> Array("genderMatch", "userCountryIndex", "userFeatures")))
        .put(fvAssembler.dense, true)
        .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes "features" and "label" columns.
      
      val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier)
      val model = pipeline.fit(training, paramMap)
      
      sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
      val test = sqlContext.sql("""
        SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
               user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures,
               ad.targetGender AS targetGender
          FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;""")
      
      val prediction = model.transform(test).select('eventId, 'prediction)
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mengxr Xiangrui Meng
            mengxr Xiangrui Meng
            Michael Armbrust Michael Armbrust
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment